The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
It is quite easy to setup, it takes me less than 15 minutes to finish a submission.
kaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
%config Completer.use_jedi = False
from time import time, ctime
nb_start = time()
print("Note Book Start time: ", ctime(nb_start))
!pip install kaggle
!pwd
!mkdir ~/.kaggle
!cp /root/shared/Downloads/kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions files home-credit-default-risk
If running on Google Drive Run below else not
#from google.colab import drive
#drive.mount('/content/drive',force_remount=True)
#import os
#os.chdir("/content/drive/My Drive")
#!ls
#import pandas as pd
#data = pd.read_csv('/content/drive/My Drive/AML Project/Data/bureau.csv')
data.head(5)
#DATA_DIR='/content/drive/My Drive/AML Project/Data/'
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
There are 7 different sources of data:
DATA_DIR = "/root/shared/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/HCDR_Phase_1_baseline_submission/data" #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
!mkdir $DATA_DIR
!ls -l $DATA_DIR
! kaggle competitions download home-credit-default-risk -p $DATA_DIR
%config Completer.use_jedi = False
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
from time import time, ctime
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, log_loss, classification_report, roc_auc_score, make_scorer
from scipy import stats
import json
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
unzippingReq = False
if unzippingReq: #please modify this code
zip_file = DATA_DIR + '/' + 'home-credit-default-risk.zip'
zip_ref = zipfile.ZipFile(zip_file, 'r')
zip_ref.extractall(path=DATA_DIR)
zip_ref.close()
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
The application dataset has the most information about the client: Gender, income, family status, education ...
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
print('\033[1m' + "Size of each dataset : " + '\033[0m' , end = '\n' * 2)
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]:4}]')
(datasets['application_train'].dtypes).unique()
from IPython.display import display, HTML
pd.set_option("display.max_rows", None, "display.max_columns", None)
# Full stats
def stats_summary1(df, df_name):
print(datasets[df_name].info(verbose=True, null_counts=True ))
print("-----"*15)
print(f"Shape of the df {df_name} is {df.shape} \n")
print("-----"*15)
print(f"Statistical summary of {df_name} is :")
print("-----"*15)
print(f"Description of the df {df_name}:\n")
print(display(HTML(np.round(datasets['application_train'].describe(),2).to_html())))
#print(f"Description of the df {df_name}:\n",np.round(datasets['application_train'].describe(),2))
def stats_summary2(df, df_name):
print(f"Description of the df continued for {df_name}:\n")
print("-----"*15)
print("Data type value counts: \n",df.dtypes.value_counts())
print("\nReturn number of unique elements in the object. \n")
print(df.select_dtypes('object').apply(pd.Series.nunique, axis = 0))
# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
print("-----"*15)
print(f"Categorical and Numerical(int + float) features of {df_name}.")
print("-----"*15)
print()
for k, v in df_dtypes.items():
print({k.name: v})
print("---"*10)
print("\n \n")
# Null data list and plot.
def null_data_plot(df, df_name):
percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = df.isna().sum().sort_values(ascending = False)
missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_data=missing_data[missing_data['Percent'] > 0]
print("-----"*15)
print("-----"*15)
print('\n The Missing Data: \n')
# display(missing_data) # display few
if len(missing_data)==0:
print("No missing Data")
else:
display(HTML(missing_data.to_html())) # display all the rows
print("-----"*15)
if len(df.columns)> 35:
f,ax =plt.subplots(figsize=(8,15))
else:
f,ax =plt.subplots()
#plt.xticks(rotation='90')
#fig=sns.barplot(missing_data.index, missing_data["Percent"],alpha=0.8)
#plt.xlabel('Features', fontsize=15)
#plt.ylabel('Percent of missing values', fontsize=15)
plt.title(f'Percent missing data for {df_name}.', fontsize=10)
fig=sns.barplot(missing_data["Percent"],missing_data.index ,alpha=0.8)
plt.xlabel('Percent of missing values', fontsize=10)
plt.ylabel('Features', fontsize=10)
return missing_data
# Full consolidation of all the stats function.
def display_stats(df, df_name):
print("--"*40)
print(" "*20 + '\033[1m'+ df_name + '\033[0m' +" "*20)
print("--"*40)
stats_summary1(df, df_name)
def display_feature_info(df, df_name):
stats_summary2(df, df_name)
feature_datatypes_groups(df, df_name)
null_data_plot(df, df_name)
display_stats(datasets['application_train'], 'application_train')
display_feature_info(datasets['application_train'], 'application_train')
datasets["application_train"]['DAYS_EMPLOYED'].describe()
anom_days_employed = datasets["application_train"][datasets["application_train"]['DAYS_EMPLOYED']==365243]
norm_days_employed = datasets["application_train"][datasets["application_train"]['DAYS_EMPLOYED']!=365243]
print(anom_days_employed.shape)
dr_anom = anom_days_employed['TARGET'].mean()*100
dr_norm = norm_days_employed['TARGET'].mean()*100
print('Default rate (Anomaly): {:.2f}'.format(dr_anom))
print('Default rate (Normal): {:.2f}'.format(dr_norm))
pct_anom_days_employed = (anom_days_employed.shape[0]/datasets["application_train"].shape[0])*100
print(pct_anom_days_employed)
df_app_train=datasets["application_train"].copy()
df_app_train['DAYS_EMPLOYED_ANOM'] = df_app_train['DAYS_EMPLOYED'] == 365243
df_app_train['DAYS_EMPLOYED'].replace({365243:np.nan}, inplace=True)
plt.hist(df_app_train['DAYS_EMPLOYED'],edgecolor = 'k', bins = 25)
plt.title('DAYS_EMPLOYED'); plt.xlabel('No Of Days as per Dataset'); plt.ylabel('Count');
The bins above histogram shows that the data is not logical and this feature needs to be further investigated for imbalances. Number of days employed would show a steady source of income and could be a useful feature for predicting risk
plt.hist(datasets["application_train"]['OWN_CAR_AGE'],edgecolor = 'k', bins = 25)
plt.title('OWN CAR AGE'); plt.xlabel('No Of Days as per Dataset'); plt.ylabel('Count');
We see that those who have cars over 60 years old have a number of applications (i.e., 3339). This could a good area to investigate risk
display_feature_info(datasets['application_train'], 'application_train')
plt.hist(datasets["application_train"]['DAYS_BIRTH']/-365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"], order = datasets["application_train"]['OCCUPATION_TYPE'].value_counts().index);
plt.title('Applicants Occupation');
plt.xticks(rotation=90);
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline
sns.set(color_codes=True)
def generic_xy_boxplot(xaxisfeature,yaxisfeature,legendcategory,data,log_scale):
sns.boxplot(xaxisfeature, yaxisfeature, hue = legendcategory, data = data)
plt.title('Boxplot for '+ xaxisfeature +' with ' + yaxisfeature+' and '+legendcategory,fontsize=10)
if log_scale:
plt.yscale('log')
plt.ylabel(f'{yaxisfeature} (log Scale)')
plt.tight_layout()
def box_plot(plots):
number_of_subplots = len(plots)
plt.figure(figsize = (20,8))
sns.set_style('whitegrid')
for i, ele in enumerate(plots):
plt.subplot(1, number_of_subplots, i + 1)
plt.subplots_adjust(wspace=0.25)
xaxisfeature=ele[0]
yaxisfeature=ele[1]
legendcategory=ele[2]
data=ele[3]
log_scale=ele[4]
generic_xy_boxplot(xaxisfeature,yaxisfeature,legendcategory,data,log_scale)
plots=[['NAME_CONTRACT_TYPE','AMT_CREDIT','CODE_GENDER',datasets['application_train'],False]]
box_plot(plots)
Gender does not indicate a major impact . But credit amount for cash loans are significantly high compared to revolving loans.
display_stats(datasets['previous_application'], 'previous_application')
display_feature_info(datasets['previous_application'], 'previous_application')
display_stats(datasets['bureau'], 'bureau')
display_feature_info(datasets['bureau'], 'bureau')
display_stats(datasets['bureau_balance'], 'bureau_balance')
display_feature_info(datasets['bureau_balance'], 'bureau_balance')
display_stats(datasets['credit_card_balance'], 'credit_card_balance')
display_feature_info(datasets['credit_card_balance'], 'credit_card_balance')
display_stats(datasets['installments_payments'], 'installments_payments')
display_feature_info(datasets['installments_payments'], 'installments_payments')
display_stats(datasets['POS_CASH_balance'], 'POS_CASH_balance')
display_feature_info(datasets['POS_CASH_balance'], 'POS_CASH_balance')
display_stats(datasets['application_test'], 'application_test')
display_feature_info(datasets['application_test'], 'application_test')
The top 20 correlated features (positive and negative) for application train datset are listed below.
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10))
num_attribs = ['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED',
'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
df = datasets["application_train"].copy()
df2 = df[num_attribs]
corr = df2.corr()
corr.style.background_gradient(cmap='PuBu').set_precision(2)
The distribution of the top correlated features are plotted below
var_neg_corr = correlations.head(10).index.values
numVar = var_neg_corr.shape[0]
plt.figure(figsize=(15,20))
for i,var in enumerate(var_neg_corr):
dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
plt.subplot(numVar,4,i+1)
datasets["application_train"][var].hist()
plt.title(var, fontsize = 10)
plt.tight_layout()
plt.show()
var_pos_corr = correlations.tail(10).index.values
numVar = var_pos_corr.shape[0]
plt.figure(figsize=(15,20))
for i,var in enumerate(var_pos_corr):
dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
plt.subplot(numVar,4,i+1)
datasets["application_train"][var].hist()
plt.title(var, fontsize = 10)
plt.tight_layout()
plt.show()
def cat_features_plot(datasets, df_name):
df = datasets[df_name].copy()
df['TARGET'].replace(0, "No Default", inplace=True)
df['TARGET'].replace(1, "Default", inplace=True)
# df.select_dtypes('object')
categorical_col = []
for col in df:
if df[col].dtype == 'object':
categorical_col.append(col)
# print("The numerical olumns are: \n \n ",numerical_col)
#print("The categorical columns are: \n \n ",categorical_col)
# categorical_col = categorical_col[0:8]
#print(int(len(categorical_col)))
plot_x = int(len(categorical_col)/2)
fig, ax = plt.subplots(plot_x, 2, figsize=(20, 50))
#plt.subplots_adjust(left=None, bottom=None, right=None,
#top=None, wspace=None, hspace=0.45)
num = 0
for i in range(0, 8):
for j in range(0,2):
tst = sns.countplot(x=categorical_col[num],
data=df, hue='TARGET', ax=ax[i][j])
tst.set_title(f"Distribution of the {categorical_col[num]} Variable.")
tst.set_xticklabels(tst.get_xticklabels(), rotation=90)
plt.subplots_adjust(left=None, bottom=None, right=None,
top=None, wspace=None, hspace=0.45)
num = num + 1
plt.tight_layout()
cat_features_plot(datasets, "application_train")
def numerical_features_plot(datasets, df_name):
df = datasets[df_name].copy()
df['TARGET'].replace(0, "No Default", inplace=True)
df['TARGET'].replace(1, "Default", inplace=True)
numerical_col = []
for col in df:
if df[col].dtype == 'int64' or df[col].dtype == 'float64' :
numerical_col.append(col)
print(numerical_col)
print(len(numerical_col))
# num_attribs = ['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED',
# 'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
df2 = df[numerical_col]
# Scatter-plot
df2.fillna(0, inplace=True)
# print('Numerical variables - Scatter-Matrix')
grr = pd.plotting.scatter_matrix(df2.loc[:, df2.columns != 'TARGET'], c = datasets[df_name]['TARGET'], figsize=(15, 15), marker='.',
hist_kwds={'bins': 10}, s=60, alpha=.2)
# Pair-plot
df2['TARGET'].replace(0, "No Default", inplace=True)
df2['TARGET'].replace(1, "Default", inplace=True)
# print('Numerical variables - Pair-Plot')
num_sns = sns.pairplot(df2, hue="TARGET", markers=["s", "o"])
# num_sns.title("Numerical variables - Pair-Plot")
# numerical_features_plot(datasets, "application_train")
# correlation
# head(10)
# tail(10)
# numerical()
# create the scatter plot and pairwise plot
run = True
if run:
df_name = 'application_train'
num_attribs = ['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED',
'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
df = datasets[df_name].copy()
df2 = df[num_attribs]
# Scatter-plot
df2.fillna(0, inplace=True)
# print('Numerical variables - Scatter-Matrix')
grr = pd.plotting.scatter_matrix(df2.loc[:, df2.columns != 'TARGET'],
c = datasets[df_name]['TARGET'],
figsize=(15, 15), marker='.',
hist_kwds={'bins': 10}, s=60, alpha=.2)
# Pair-plot
df2['TARGET'].replace(0, "No Default", inplace=True)
df2['TARGET'].replace(1, "Default", inplace=True)
# print('Numerical variables - Pair-Plot')
num_sns = sns.pairplot(df2, hue="TARGET", markers=["s", "o"])
# num_sns.title("Numerical variables - Pair-Plot")
Correlation Map of Numerical Variables
num_attribs = ['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED',
'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
df = datasets["application_train"].copy()
df2 = df[num_attribs]
corr = df2.corr()
corr.style.background_gradient(cmap='PuBu').set_precision(2)
var_neg_corr = correlations.head(10).index.values
numVar = var_neg_corr.shape[0]
plt.figure(figsize=(10,40))
for i,var in enumerate(var_neg_corr):
dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
plt.subplot(numVar,3,i+1)
plt.subplots_adjust(wspace=2)
sns.kdeplot(dflt_var,label='Default')
sns.kdeplot(dflt_non_var,label='No Default')
#plt.xlabel(var)
plt.ylabel('Density')
plt.title(var, fontsize = 10)
plt.tight_layout()
plt.show()
var_pos_corr = correlations.tail(10).index.values
numVar = var_pos_corr.shape[0]
plt.figure(figsize=(10,40))
for i,var in enumerate(var_pos_corr):
dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
if var=='TARGET':
pass
else:
plt.subplot(numVar,3,i+1)
plt.subplots_adjust(wspace=2)
sns.kdeplot(dflt_var,label='Default')
sns.kdeplot(dflt_non_var,label='No Default')
#plt.xlabel(var)
plt.ylabel('Density')
plt.title(var, fontsize = 10)
plt.tight_layout()
plt.show()
We plot the KDEs of the most positively (negatively) correlated features with the TARGET. This is to evaluate whether there are any strange distributions between the default and do not default items.
If the distributions for each feature are very different for default and do not default, this is good and we should look out for this. So we can see that EXT_SOURCE_3 has the most different distributions between default and no default.
datasets.keys()
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
datasets["application_test"].shape
datasets["application_train"].shape
The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.
appsDF = datasets["previous_application"]
appsDF.shape
len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"]))
print(f"There are {appsDF.shape[0]:,} previous applications")
# How many entries are there for each month?
prevAppCounts = appsDF['SK_ID_CURR'].value_counts(dropna=False)
len(prevAppCounts[prevAppCounts >40]) #more that 40 previous applications
prevAppCounts[prevAppCounts >50].plot(kind='bar')
plt.xticks(rotation=100)
plt.xlabel('ID')
plt.ylabel('Counts')
plt.title('Applicants with more than 50 applications')
plt.show()
The above visual indicates that are applicants with more than 50 applications in the dataset.
sum(appsDF['SK_ID_CURR'].value_counts()==1)
plt.hist(appsDF['SK_ID_CURR'].value_counts(), cumulative =True, bins = 100);
plt.grid()
plt.ylabel('cumulative number of IDs')
plt.xlabel('Number of previous applications per ID')
plt.title('Histogram of Number of previous applications for an ID')
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)
apps_all = appsDF['SK_ID_CURR'].nunique()
apps_5plus = appsDF['SK_ID_CURR'].value_counts()>=5
apps_40plus = appsDF['SK_ID_CURR'].value_counts()>=40
print('Percentage with 10 or more previous apps:', np.round(100.*(sum(apps_5plus)/apps_all),5))
print('Percentage with 40 or more previous apps:', np.round(100.*(sum(apps_40plus)/apps_all),5))
In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?
previous_application with application_x¶We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.
Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:
AMT_APPLICATION, AMT_CREDIT could be based on average, min, max, median, etc.To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).
When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]I want you to think about this section and build on this.
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset)), thereby leading to X_train, y_train, X_valid, etc.So far, both our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.
Use &, | , ~ Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with pandas. The details of why are explained here.
You must use the following operators with pandas:
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704) & ~(appsDF["AMT_CREDIT"]==1.0)]
appsDF.isna().sum()
appsDF.columns
Feature engineering for highly correlated numerical features using mean, min , max , sum & count aggregation functions.
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
agg_op_features = {}
cols = []
agg_func_list=["mean", "min", "max"]
for f in features: #build agg dictionary
agg_op_features[f] = agg_func_list
cols.append(f"{f}_{func}" for func in agg_func_list)
print(agg_op_features)
print(f"{appsDF[features].describe()}")
print()
# # results = appsDF.groupby('SK_ID_CURR').agg({'AMT_ANNUITY': ['mean', 'min', 'max'],'AMT_APPLICATION': ['mean', 'min', 'max'] })
# result = appsDF.groupby('SK_ID_CURR').agg({features[0]: ['mean', 'min', 'max'],features[1]: ['mean', 'min', 'max'] })
result = appsDF.groupby('SK_ID_CURR').agg(agg_op_features)
result.columns = ["_".join(x) for x in result.columns.ravel()]
result = result.reset_index(level=["SK_ID_CURR"])
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
print(f"result.shape: {result.shape}")
result.head(10)
result.isna().sum()
Testing Feature Transformer for selected field
# Create aggregate features (via pipeline)
class prevAppsFeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, features=None): # no *args or **kargs
self.features = features
self.agg_op_features = {}
for f in features:
# self.agg_op_features[f] = {f"{f}_{func}":func for func in ["min", "max", "mean"]}
self.agg_op_features[f] = ["min", "max", "mean"]
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
# result.columns = result.columns.droplevel()
result.columns = ["_".join(x) for x in result.columns.ravel()]
result = result.reset_index(level=["SK_ID_CURR"])
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
return result # return dataframe with the join key "SK_ID_CURR"
from sklearn.pipeline import make_pipeline
def test_driver_prevAppsFeaturesAggregater(df, features):
print(f"df.shape: {df.shape}\n")
print(f"df[{features}][0:5]: \n{df[features][0:5]}")
test_pipeline = make_pipeline(prevAppsFeaturesAggregater(features))
return(test_pipeline.fit_transform(df))
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
features = ['AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CNT_PAYMENT',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION']
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
print(f"HELLO")
print(f"Test driver: \n{res[0:10]}")
print(f"input[features][0:10]: \n{appsDF[0:10]}")
# QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
Choosing Highly correlated features from all input datasets
def correlation_files_target(df_name):
A = datasets["application_train"].copy()
B = datasets[df_name].copy()
correlation_matrix = pd.concat([A.TARGET, B], axis=1).corr().filter(B.columns).filter(A.columns, axis=0)
return correlation_matrix
df_name = "previous_application"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
df_name = "bureau"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
df_name = "bureau_balance"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
df_name = "credit_card_balance"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
#df_name = "installments_payments"
#correlation_matrix = correlation_files_target(df_name)
#print(f"Correlation of the {df_name} against the Target is :")
#correlation_matrix.T.TARGET.sort_values(ascending= False)
#df_name = "POS_CASH_balance"
#correlation_matrix = correlation_files_target(df_name)
#print(f"Correlation of the {df_name} against the Target is :")
#correlation_matrix.T.TARGET.sort_values(ascending= False)
agg_funcs = ['min', 'max', 'mean', 'count', 'sum']
prevApps = datasets['previous_application']
prevApps_features = ['AMT_ANNUITY', 'AMT_APPLICATION']
bureau = datasets['bureau']
bureau_features = ['AMT_ANNUITY', 'AMT_CREDIT_SUM']
# bureau_funcs = ['min', 'max', 'mean', 'count', 'sum']
bureau_bal = datasets['bureau_balance']
bureau_bal_features = ['MONTHS_BALANCE']
cc_bal = datasets['credit_card_balance']
cc_bal_features = ['MONTHS_BALANCE', 'AMT_BALANCE', 'CNT_INSTALMENT_MATURE_CUM']
installments_pmnts = datasets['installments_payments']
installments_pmnts_features = ['AMT_INSTALMENT', 'AMT_PAYMENT']
# Pipelines
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
class FeaturesAggregator(BaseEstimator, TransformerMixin):
def __init__(self, file_name=None, features=None, funcs=None): # no *args or **kargs
self.file_name = file_name
self.features = features
self.funcs = funcs
self.agg_op_features = {}
for f in self.features:
temp = {f"{file_name}_{f}_{func}":func for func in self.funcs}
self.agg_op_features[f]=[(k, v) for k, v in temp.items()]
print(self.agg_op_features)
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
result.columns = result.columns.droplevel()
result = result.reset_index(level=["SK_ID_CURR"])
return result # return dataframe with the join key "SK_ID_CURR"
class engineer_features(BaseEstimator, TransformerMixin):
def __init__(self, features=None):
self
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# FROM APPLICATION
# ADD INCOME CREDIT PERCENTAGE
X['ef_INCOME_CREDIT_PERCENT'] = (
X.AMT_INCOME_TOTAL / X.AMT_CREDIT).replace(np.inf, 0)
# ADD INCOME PER FAMILY MEMBER
X['ef_FAM_MEMBER_INCOME'] = (
X.AMT_INCOME_TOTAL / X.CNT_FAM_MEMBERS).replace(np.inf, 0)
# ADD ANNUITY AS PERCENTAGE OF ANNUAL INCOME
X['ef_ANN_INCOME_PERCENT'] = (
X.AMT_ANNUITY / X.AMT_INCOME_TOTAL).replace(np.inf, 0)
# FROM MERGED PREVIOUS APPLICATION
# ADD PREVIOUS APPLICATION RANGE
# X['ef_prevApps_AMT_APPLICATION_RANGE'] = (
# X.prevApps_AMT_ANNUITY_max - X.prevApps_AMT_ANNUITY_min).replace(np.inf, 0)
# FROM MERGED BUREAU
# ADD BUREAU CREDIT RANGE
# X['ef_bureau_AMT_CREDIT_RANGE'] = (
# X.bureau_AMT_CREDIT_SUM_max - X.bureau_AMT_CREDIT_SUM_min).replace(np.inf, 0)
return X
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
prevApps_feature_pipeline = Pipeline([
# ('prevApps_add_features1', prevApps_add_features1()), # add some new features
# ('prevApps_add_features2', prevApps_add_features2()), # add some new features
('prevApps_aggregator', FeaturesAggregator('prevApps', prevApps_features, agg_funcs)), # Aggregate across old and new features
])
bureau_feature_pipeline = Pipeline([
# ('bureau_add_features1', bureau_add_features1()), # add some new features
# ('bureau_add_features2', bureau_add_features2()), # add some new features
('bureau_aggregator', FeaturesAggregator('bureau', bureau_features, agg_funcs)), # Aggregate across old and new features
])
bureau_bal_features_pipeline = Pipeline([
# ('bureau_add_features1', bureau_add_features1()), # add some new features
# ('bureau_add_features2', bureau_add_features2()), # add some new features
('bureau_bal_aggregator', FeaturesAggregator('bureau_balance', bureau_bal_features , agg_funcs)), # Aggregate across old and new features
])
cc_bal_features_pipeline = Pipeline([
# ('bureau_add_features1', bureau_add_features1()), # add some new features
# ('bureau_add_features2', bureau_add_features2()), # add some new features
('cc_bal_aggregator', FeaturesAggregator('credit_card_balance', cc_bal_features , agg_funcs)), # Aggregate across old and new features
])
installments_pmnts_features_pipeline = Pipeline([
# ('bureau_add_features1', bureau_add_features1()), # add some new features
# ('bureau_add_features2', bureau_add_features2()), # add some new features
('installments_pmnts_features_aggregator', FeaturesAggregator('credit_card_balance', installments_pmnts_features , agg_funcs)), # Aggregate across old and new features
])
# Feature engineering pipeline for application_train
appln_feature_pipeline = Pipeline([
('engineer_features', engineer_features()), # add some new features
])
appsTrainDF = datasets['application_train']
prevAppsDF = datasets["previous_application"] #prev app
bureauDF = datasets["bureau"] #bureau app
bureaubalDF = datasets['bureau_balance']
ccbalDF = datasets["credit_card_balance"] #prev app
installmentspaymentsDF = datasets["installments_payments"] #bureau app
appsTrainDF = appln_feature_pipeline.fit_transform(appsTrainDF)
prevApps_aggregated = prevApps_feature_pipeline.fit_transform(prevAppsDF)
bureau_aggregated = bureau_feature_pipeline.fit_transform(bureauDF)
# bureaubal_aggregated = bureau_bal_features_pipeline.fit_transform(bureaubalDF)
ccblance_aggregated = cc_bal_features_pipeline.fit_transform(ccbalDF)
installments_pmnts_aggregated = installments_pmnts_features_pipeline.fit_transform(installmentspaymentsDF)
installments_pmnts_aggregated.head()
datasets.keys()
merge_all_data = True
if merge_all_data:
prevApps_aggregated = prevApps_feature_pipeline.transform(appsDF)
# merge primary table and secondary tables using features based on meta data and aggregage stats
if merge_all_data:
appsTrainDF = appsTrainDF.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
appsTrainDF = appsTrainDF.merge(bureau_aggregated, how='left', on="SK_ID_CURR")
appsTrainDF = appsTrainDF.merge(ccblance_aggregated, how='left', on="SK_ID_CURR")
appsTrainDF = appsTrainDF.merge(installments_pmnts_aggregated, how='left', on="SK_ID_CURR")
# Features have been increased from 122 to 170.
print(appsTrainDF.shape)
appsTrainDF.head(3)
X_kaggle_test= datasets["application_test"]
X_kaggle_test = appln_feature_pipeline.fit_transform(X_kaggle_test)
merge_all_data = True
if merge_all_data:
X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
X_kaggle_test = X_kaggle_test.merge(bureau_aggregated, how='left', on="SK_ID_CURR")
X_kaggle_test = X_kaggle_test.merge(ccblance_aggregated, how='left', on="SK_ID_CURR")
X_kaggle_test = X_kaggle_test.merge(installments_pmnts_aggregated, how='left', on="SK_ID_CURR")
print(X_kaggle_test.shape)
X_kaggle_test.head(3)
print(appsTrainDF.shape)
appsTrainDF.head()
appsTrainDF[['ef_INCOME_CREDIT_PERCENT', 'ef_FAM_MEMBER_INCOME', 'ef_ANN_INCOME_PERCENT']]
Deductions from the list of dtypes of the appsTrainDF
appsTrainDF.dtypes.value_counts()
start = time()
correlation_with_all_features = appsTrainDF.corr()
end = time()
print("Time taken for correlation ", ctime(end - start))
print()
correlation_with_all_features['TARGET'].sort_values()
# correlation_with_all_features.reset_index(inplace= True)
len(correlation_with_all_features.index)
# set this value to choose the number of positive and negative correlated features
n_val = 15
print("---"*15)
print("---"*15)
print(" Total correlation of all the features. " )
print("---"*15)
print("---"*15)
print(f"Top {n_val} negative correlated features")
print()
print(correlation_with_all_features.TARGET.sort_values(ascending = True).head(n_val))
print()
print()
print(f"Top {n_val} positive correlated features")
print()
print(correlation_with_all_features.TARGET.sort_values(ascending = True).tail(n_val))
correlation_with_all_features.TARGET.sort_values(ascending = True)[-n_val:]
tf_apps_train_final = []
featureslist1 = correlation_with_all_features.TARGET.sort_values(ascending = True)[:n_val].index.tolist()
featureslist2 = correlation_with_all_features.TARGET.sort_values(ascending = True)[-n_val:].index.tolist()
tf_apps_train_final = featureslist1 + featureslist2
tf_apps_train_final.remove('TARGET')
print(len(tf_apps_train_final))
display((tf_apps_train_final))
appsTrainDF.dtypes.unique()
for idx in tf_apps_train_final:
print(f"{idx:50} {appsTrainDF[idx].dtypes}")
modeling_num_attrib = []
modeling_cat_attrib = []
for idx in tf_apps_train_final:
if appsTrainDF[idx].dtypes in ['int64', 'float64']:
modeling_num_attrib.append(idx)
else:
modeling_cat_attrib.append(idx)
print(len(modeling_num_attrib))
print(len(modeling_cat_attrib))
# Convert categorical features to numerical approximations (via pipeline)
class ClaimAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
charlson_idx_dt = {'0': 0, '1-2': 2, '3-4': 4, '5+': 6}
los_dt = {'1 day': 1, '2 days': 2, '3 days': 3, '4 days': 4, '5 days': 5, '6 days': 6,
'1- 2 weeks': 11, '2- 4 weeks': 21, '4- 8 weeks': 42, '26+ weeks': 180}
X['PayDelay'] = X['PayDelay'].apply(lambda x: int(x) if x != '162+' else int(162))
X['DSFS'] = X['DSFS'].apply(lambda x: None if pd.isnull(x) else int(x[0]) + 1)
X['CharlsonIndex'] = X['CharlsonIndex'].apply(lambda x: charlson_idx_dt[x])
X['LengthOfStay'] = X['LengthOfStay'].apply(lambda x: None if pd.isnull(x) else los_dt[x])
return X
from sklearn.base import BaseEstimator, TransformerMixin
import re
# Creates the following date features
# But could do so much more with these features
# E.g.,
# extract the domain address of the homepage and OneHotEncode it
#
# ['release_month','release_day','release_year', 'release_dayofweek','release_quarter']
class prep_OCCUPATION_TYPE(BaseEstimator, TransformerMixin):
def __init__(self, features="OCCUPATION_TYPE"): # no *args or **kargs
self.features = features
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X):
df = pd.DataFrame(X, columns=self.features)
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(lambda x: 1. if x in ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'] else 0.)
#df.drop(self.features, axis=1, inplace=True)
return np.array(df.values) #return a Numpy Array to observe the pipeline protocol
from sklearn.pipeline import make_pipeline
features = ["OCCUPATION_TYPE"]
def test_driver_prep_OCCUPATION_TYPE():
print(f"X_train.shape: {X_train.shape}\n")
print(f"X_train['name'][0:5]: \n{X_train[features][0:5]}")
test_pipeline = make_pipeline(prep_OCCUPATION_TYPE(features))
return(test_pipeline.fit_transform(X_train))
#x = test_driver_prep_OCCUPATION_TYPE()
#print(f"Test driver: \n{test_driver_prep_OCCUPATION_TYPE()[0:10, :]}")
#print(f"X_train['name'][0:10]: \n{X_train[features][0:10]}")
# QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
#train_dataset = datasets["application_train"]
train_dataset=appsTrainDF
class_labels = ["No Default","Default"]
# Create a class to select numerical or categorical columns since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# Identify the numeric features we wish to consider.
#num_attribs = [
# 'AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
# 'EXT_SOURCE_2','EXT_SOURCE_3']
num_attribs = [
'AMT_INCOME_TOTAL',
'AMT_CREDIT',
'EXT_SOURCE_3',
'EXT_SOURCE_2',
'EXT_SOURCE_1',
'DAYS_EMPLOYED',
'DAYS_BIRTH',
'FLOORSMAX_AVG',
'FLOORSMAX_MEDI',
'FLOORSMAX_MODE',
'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE',
'ELEVATORS_AVG',
'REG_CITY_NOT_LIVE_CITY',
'FLAG_EMP_PHONE',
'REG_CITY_NOT_WORK_CITY',
'DAYS_ID_PUBLISH',
'DAYS_LAST_PHONE_CHANGE',
'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY',
## Highly correlated previous applications
'prevApps_AMT_ANNUITY_mean',
## Highly correlated Credit card balance features
'credit_card_balance_MONTHS_BALANCE_count',
'credit_card_balance_AMT_BALANCE_count',
'credit_card_balance_CNT_INSTALMENT_MATURE_CUM_count',
'credit_card_balance_CNT_INSTALMENT_MATURE_CUM_sum',
'credit_card_balance_MONTHS_BALANCE_sum',
'credit_card_balance_MONTHS_BALANCE_min',
'credit_card_balance_MONTHS_BALANCE_mean',
'credit_card_balance_AMT_BALANCE_min',
'credit_card_balance_AMT_BALANCE_max',
'credit_card_balance_AMT_BALANCE_mean'
]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', SimpleImputer(strategy='mean')),
('std_scaler', StandardScaler()),
])
Train, validation and Test sets (and the leakage problem we have mentioned previously):
Let's look at a small usecase to tell us how to deal with this:
ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.
Here is a example that in action:
# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('imputer', SimpleImputer(strategy='most_frequent')),
#('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
With Feature union, combine numerical and categorical Pipeline together to prepare for Data pipeline
data_prep_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
# appending the features found from the correlation matrix to manually selected features.
#for item in modeling_num_attrib:
# if item not in num_attribs:
# num_attribs.append(item)
#
#for item in modeling_cat_attrib:
# if item not in cat_attribs:
# cat_attribs.append(item)
#num_attribs.append(modeling_num_attrib)
#cat_attribs.append(modeling_cat_attrib)
#num_attribs = list(dict.fromkeys(num_attribs))
#cat_attribs = list(dict.fromkeys(cat_attribs))
selected_features = num_attribs + cat_attribs
tot_features = f"{len(selected_features)}: Num:{len(num_attribs)}, Cat:{len(cat_attribs)}"
#Total Feature selected for processing
tot_features
Since HCDR is a Classification task, we are going to use the following metrics to measure the Model performance
This metric describes the fraction of correctly classified samples. In SKLearn, it can be modified to return solely the number of correct samples.Accuracy is the default scoring method for both logistic regression and k-Nearest Neighbors in scikit-learn.
The precision is the ratio of true positives over the total number of predicted positives.
The recall is the ratio of true positives over the true positives and false negatives. Recall is assessing the ability of the classifier to find all the positive samples. The best value is 1 and the worst value is 0
The F1 score is a metric that has a value of 0 - 1, with 1 being the best value. The F1 score is a weighted average of the precision and recall, with the contributions of precision and recall are the same
The confusion matrix, in this case for a binary classification, is a 2x2 matrix that contains the count of the true positives, false positives, true negatives, and false negatives.
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
▪ True Positive Rate
▪ False Positive Rate
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1).
AUC is desirable for the following two reasons:
CXE measures the performance of a classification model whose output is a probability value between 0 and 1. CXE increases as the predicted probability diverges from the actual label. Therefore, we choose a parameter, which would minimize the binary CXE loss function.
The log loss formula for the binary case is as follows :
$$ -\frac{1}{m}\sum^m_{i=1}\left(y_i\cdot\:\log\:\left(p_i\right)\:+\:\left(1-y_i\right)\cdot\log\left(1-p_i\right)\right) $$p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.
We will compare the classifiers with the baseline untuned model by conducting two-tailed hypothesis test.
Null Hypothesis, H0: There is no significant difference between the two machine learning pipelines. Alternate Hypothesis, HA: The two machine learning pipelines are different. A p-value less than or equal to the significance level is considered statistically significant.
def confusion_matrix_def(model,X_train,y_train,X_test,y_test):
#Prediction
preds_test = model.predict(X_test)
preds_train = model.predict(X_train)
cm_train = confusion_matrix(y_train, preds_train).astype(np.float32)
#print(cm_train)
cm_train /= cm_train.sum(axis=1)[:, np.newaxis]
cm_test = confusion_matrix(y_test, preds_test).astype(np.float32)
#print(cm_test)
cm_test /= cm_test.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=(20, 8))
plt.subplot(121)
g = sns.heatmap(cm_train, vmin=0, vmax=1, annot=True, cmap="Reds")
plt.xlabel("Predicted", fontsize=14)
plt.ylabel("True", fontsize=14)
g.set(xticklabels=class_labels, yticklabels=class_labels)
plt.title("Train", fontsize=14)
plt.subplot(122)
g = sns.heatmap(cm_test, vmin=0, vmax=1, annot=True, cmap="Reds")
plt.xlabel("Predicted", fontsize=14)
plt.ylabel("True", fontsize=14)
g.set(xticklabels=class_labels, yticklabels=class_labels)
plt.title("Test", fontsize=14);
# Split Sample to feed the pipeline and it will result in a new dataset that is (1 / splits) the size
splits = 3
# Train Test split percentage
subsample_rate = 0.3
finaldf = np.array_split(train_dataset, splits)
X_train = finaldf[0][selected_features]
y_train = finaldf[0]['TARGET']
X_kaggle_test= X_kaggle_test[selected_features]
## split part of data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
test_size=subsample_rate, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,test_size=0.15, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X X_kaggle_test shape: {X_kaggle_test.shape}")
Convert the percentage to %
def pct(x):
return round(100*x,3)
Define dataframe with all metrics included
#del expLog
try:
expLog
except NameError:
expLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC",
"Train F1 Score",
"Test F1 Score",
"Train Log Loss",
"Test Log Loss",
"P Score",
"Train Time",
"Test Time",
"Description"
])
%%time
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("linear", LogisticRegression())
])
Split the training data to 15 fold to perform Crossfold validation
cvSplits = ShuffleSplit(n_splits=15, test_size=0.3, random_state=0)
X_train.head(5)
start = time()
model = full_pipeline_with_predictor.fit(X_train, y_train)
np.random.seed(42)
# Set up cross validation scores
logit_scores = cross_val_score(full_pipeline_with_predictor,X_train , y_train,cv=cvSplits)
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)
# Time and score test predictions
start = time()
logit_score_test = full_pipeline_with_predictor.score(X_test, y_test)
test_time = np.round(time() - start, 4)
exp_name = f"Baseline_{len(selected_features)}_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[logit_score_train,
pct(accuracy_score(y_valid, model.predict(X_valid))),
pct(accuracy_score(y_test, model.predict(X_test))),
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
f1_score(y_train, model.predict(X_train)),
f1_score(y_test, model.predict(X_test)),
log_loss(y_train, model.predict(X_train)),
log_loss(y_test, model.predict(X_test)),0 ],4)) \
+ [train_time,test_time] + [f"Imbalanced Logistic reg features {tot_features} with 20% training data"]
expLog
# Create confusion matrix for baseline model
confusion_matrix_def(model,X_train,y_train,X_test,y_test)
To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model. Since 'No default and Default' target records are not balanced in trainging set, we are going to resample the minority class("Default with target value 1") to balance the input dataset
# Train Test split percentage
#subsample_rate = 0.3
#finaldf = train_dataset
#X_train = finaldf[0][selected_features]
#y_train = finaldf[0]['TARGET']
#X_train = finaldf[selected_features]
#y_train = finaldf['TARGET']
#X_kaggle_test= datasets["application_test"][selected_features]
## split part of data
#X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
# test_size=subsample_rate, random_state=42)
#X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,test_size=0.15, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X X_kaggle_test shape: {X_kaggle_test.shape}")
# Bincount shows the imbalanced data in Target default and no default class
np.bincount(y_train)
Resampling should be performed only in the train dataset, to avoid overfitting and data leakage.
# concatenate our training data back together
train_data = pd.concat([X_train, y_train], axis=1)
train_data.head()
After resampling, both default and non-default classes are balanced
# separate minority and majority classes
no_default_data = train_data[train_data.TARGET==0]
default_data = train_data[train_data.TARGET==1]
# sample minority
default_sampled_data = resample(default_data,
replace=True, # sample with replacement
n_samples=len(no_default_data), # match number in majority class
random_state=42) # reproducible
# combine majority and upsampled minority
train_data = pd.concat([no_default_data, default_sampled_data])
train_data.TARGET.value_counts()
y_train = train_data['TARGET']
X_train = train_data[selected_features]
%%time
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("linear", LogisticRegression())
])
Split the training data to 15 fold to perform Crossfold validation
cvSplits = ShuffleSplit(n_splits=15, test_size=0.3, random_state=0)
start = time()
model = full_pipeline_with_predictor.fit(X_train, y_train)
np.random.seed(42)
# Set up cross validation scores
logit_scores = cross_val_score(full_pipeline_with_predictor,X_train , y_train,cv=cvSplits)
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)
# Time and score test predictions
start = time()
logit_score_test = full_pipeline_with_predictor.score(X_test, y_test)
test_time = np.round(time() - start, 4)
Accuracy, AUC score, F1 Score and Log loss used for measuring the baseline model
exp_name = f"Baseline_{len(selected_features)}_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[logit_score_train,
pct(accuracy_score(y_valid, model.predict(X_valid))),
pct(accuracy_score(y_test, model.predict(X_test))),
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
f1_score(y_train, model.predict(X_train)),
f1_score(y_test, model.predict(X_test)),
log_loss(y_train, model.predict(X_train)),
log_loss(y_test, model.predict(X_test)),0 ],4)) \
+ [train_time,test_time] + [f"Untuned Balanced Logistic reg features {tot_features} with 30% training data"]
expLog
# Create confusion matrix for baseline model
confusion_matrix_def(model,X_train,y_train,X_test,y_test)
Various Classification algorithms were used to compare with the best model. Following metrics were used to find the best model
classifiers = [
('Logistic Regression', LogisticRegression(solver='saga',random_state=42))]
# ,
# ('K-Nearest Neighbors', KNeighborsClassifier()),
# ('Naive Bayes', GaussianNB()),
# ('Support Vector', SVC(random_state=42))]
# ('Stochastic GD', SGDClassifier(random_state=42)),
# ('RandomForest', RandomForestClassifier(random_state=42)),
# ]
# Arrange grid search parameters for each classifier
params_grid = {
'Logistic Regression': {
'penalty': ('l1', 'l2'),
'tol': (0.0001, 0.00001, 0.0000001),
'C': (10, 1, 0.1, 0.01),
}
# ,
# 'K-Nearest Neighbors': {
# 'n_neighbors': (5, 7),
# #'n_neighbors': (3, 5, 7, 8, 11),
# 'p': (1,2),
# }
# ,
# 'Naive Bayes': {},
# 'Support Vector' : {
# 'kernel': ('rbf', 'poly'),
# 'degree': (1, 2, 3, 4, 5),
# 'C': (10, 1, 0.1, 0.01),
# }
# ,
# 'Stochastic GD': {
# 'loss': ('hinge', 'perceptron', 'log'),
# 'penalty': ('l1', 'l2', 'elasticnet'),
# 'tol': (0.0001, 0.00001, 0.0000001),
# 'alpha': (0.1, 0.01, 0.001, 0.0001),
# },
# 'RandomForest': {
# 'max_depth': [9, 15, 22, 26, 30],
# 'max_features': [1, 3, 5],
# 'min_samples_split': [5, 10, 15],
# 'min_samples_leaf': [3, 5, 10],
# 'bootstrap': [False],
# 'n_estimators':[20, 80, 150, 200, 300]},
}
results = [logit_scores]
names = ['Baseline LR']
for (name, classifier) in classifiers:
# Print classifier and parameters
print('****** START', name,'*****')
parameters = params_grid[name]
print("Parameters:")
for p in sorted(parameters.keys()):
print("\t"+str(p)+": "+ str(parameters[p]))
# generate the pipeline
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("predictor", classifier)
])
# Execute the grid search
params = {}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=5,
n_jobs=-1,verbose=1)
grid_search.fit(X_train, y_train)
# Best estimator score
best_train = pct(grid_search.best_score_)
# Best estimator fitting time
start = time()
model = grid_search.best_estimator_.fit(X_train, y_train)
train_time = round(time() - start, 4)
# Best estimator prediction time
start = time()
best_test_accuracy = pct(grid_search.best_estimator_.score(X_test, y_test))
test_time = round(time() - start, 4)
# Best train scores
best_train_scores = cross_val_score(grid_search.best_estimator_, X_train, y_train,cv=cvSplits)
best_train_accuracy = pct(best_train_scores.mean())
results.append(best_train_scores)
names.append(name)
# Conduct t-test with baseline logit (control) and best estimator (experiment)
(t_stat, p_value) = stats.ttest_rel(logit_scores, best_train_scores)
# Create confusion matrix for the best model
confusion_matrix_def(model,X_train,y_train,X_test,y_test)
# Collect the best parameters found by the grid search
print("Best Parameters:")
best_parameters = grid_search.best_estimator_.get_params()
param_dump = []
for param_name in sorted(params.keys()):
param_dump.append((param_name, best_parameters[param_name]))
print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
print("****** FINISH",name," *****")
print("")
# Record the results
exp_name = name
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[best_train_accuracy,
pct(accuracy_score(y_valid, model.predict(X_valid))),
pct(accuracy_score(y_test, model.predict(X_test))),
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
f1_score(y_train, model.predict(X_train)),
f1_score(y_test, model.predict(X_test)),
log_loss(y_train, model.predict(X_train)),
log_loss(y_test, model.predict(X_test)),
p_value
],4)) + [train_time,test_time] \
+ [json.dumps(param_dump)]
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Classification Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.grid()
pyplot.show()
pd.set_option('display.max_colwidth', None)
expLog
%%time
np.random.seed(42)
final_pipeline = Pipeline([
("preparation", data_prep_pipeline),
("linear", LogisticRegression(solver='saga', n_jobs=-1, random_state=42,
penalty='l1',
tol=.0001,
C=10
))
])
final_X_train = finaldf[0][selected_features]
final_y_train = finaldf[0]['TARGET']
start = time()
final_pipeline.fit(final_X_train, final_y_train)
train_time = round(time() - start, 4)
exp_name = "Best estimator with full train data"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[pct(accuracy_score(final_y_train, model.predict(final_X_train))),
0,
0,
roc_auc_score(final_y_train, model.predict_proba(final_X_train)[:, 1]),
0,
0,
f1_score(final_y_train, model.predict(final_X_train)),
0,
log_loss(final_y_train, model.predict(final_X_train)),
0,0],4)) + [train_time,0] \
+ [json.dumps(param_dump)]
expLog
For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
test_class_scores = final_pipeline.predict_proba(X_kaggle_test)[:, 1]
test_class_scores[0:10]
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = test_class_scores
submit_df.head()
submit_df.to_csv("submission.csv",index=False)
! kaggle competitions submit -c home-credit-default-risk -f submission.csv -m "baseline submission"
The goal of this project is to build a machine learning model to predict the customer behavior on repayment of the loan.
In Phase-1 of this project we created an outline with basic EDA on all datasets, baseline pipeline and selected metrics. Detailed statistical anaysis on the categorical and numeric features and visual exploration of the features were performed inturn to use in the base pipeline model. One problem we always want lookout when doing EDA is anomalies, missing data & imbalanced data. Further feature engineering of the highly correlated features helped us evaluate better baseline.
Our results in phase-1 shows the difference between Imbalanced untuned, Balanced untuned and tuned algorithms is statistically significant & the predictive ability of the tuned algorithm only increases training accuracy for the balanced dataset by 0.18%, resulting in a final training accuracy of 68.52% and training AOC 74.3%.
Our ROC_AUC score for Kaggle submission was 0.73253.
Home Credit is an international non-bank financial institution, which primarily focuses on lending people money regardless of their credit history. Home credit groups aim to provide positive borrowing experience to customers, who do not bank on traditional sources. Hence, Home Credit Group published a dataset on Kaggle website with the objective of identifying and solving unfair loan rejection.
The goal of this project is to build a machine learning model to predict the customer behavior on repayment of loan. Our task would be to create a pipeline to build a baseline machine learning model using logistic regression classification algorithms. The final model will be evaluated with various performance metrics to build a better model. Businesses will be able to use the output of the model to identify if loan is at risk to default. The new model built will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
The results of our machine learning pipelines will be measured using the follwing metrics;
The pipeline results will be logged, compared and ranked using the appropriate measurements. The most efficient pipeline will be submitted to the HCDR Kaggle Competition.
Workflow
For this project, we are following the proposed workflow as mentioned in Phase-0 of this project.
Overview The full dataset consists of 7 tables. There is 1 primary table and 6 secondary tables.
Bureau
This table includes all previous credits received by a customer from other financial institutions prior to their loan application. There is one row for each previous credit, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR.
The number of variables are 17.The number of data entries are 1,716,428.
Bureau Balance
This table includes the monthly balance for a previous credit at other financial institutions. There is one row for each monthly balance, meaning a many-to-one relationship with the Bureau table. We could join it with bureau table by using bureau's ID, SK_ID_BUREAU.
The number of variables are 3. The number of data entries are 27,299,925
Previous Application
This table includes previous applications for loans made by the customer at Home Credit. There is one row for each previous application, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR.
There are four types of contracts:
a. Consumer loan(POS – Credit limit given to buy consumer goods)
b. Cash loan(Client is given cash)
c. Revolving loan(Credit)
d. XNA (Contract type without values)
The number of variables are 37. The number of data entries are 1,670,214
POS CASH Balance
This table includes a monthly balance snapshot of a previous point of sale or cash loan that the customer has at Home Credit. There is one row for each monthly balance, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR.
The number of variables are 8. The number of data entries are 10,001,358
Credit Card Balance
This table includes a monthly balance snapshot of previous credit cards the customer has with Home Credit. There is one row for each previous monthly balance, meaning a many-to-one relationship with the Previous Application table.We could join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR.
The number of variables are 23. The number of data entries are 3,840,312
Installments Payments
This table includes previous repayments made or not made by the customer on credits issued by Home Credit. There is one row for each payment or missed payment, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR.
The number of variables are 8 . The number of data entries are 13,605,401
Exploratory Data Analysis is valuable to this project since it allows us to get closer to the certainty that the future results will be valid, accurately interpreted, and applicable to the proposed solution.
In phase 1 for this project this step involves looking at the summary statistics for each individual table in the model and focusing on the missing data , distribution and its central tendencies such as mean, median, count, min, max and the interquartile ranges.
Categorical and numerical features were looked at to identify anamolies in the data. Specific features were chosen to be visualized based on the correlation and distribution. The highly correlated features were used to plot the density to evaluate the distributions in comparison to the target.
Please refer section for EDA Exploratory Data Analysis
Feature engineering is a crucial part of machine learning and requires expert domain knowledge to achieve a great quality machine learning model. New features were created, based on aggregated functions including min, max, mean, sum and count. This was done for all the input files. As a part of this process the new features from the secondary files were merged with the primary table “application_train”. This resulted in a set of 170 features. Further, the top 20 highly correlated features (positive and negative) were chosen. These features were then classified into numerical and categorical variables to form inputs for 2 individual pipelines. Some features engineered were: prevApps_AMT_APPLICATION_max, 'bureau_AMT_CREDIT_SUM_max', ef_INCOME_CREDIT_PERCENT, ef_FAM_MEMBER_INCOME,ef_ANN_INCOME_PERCENT
(Please see section Correlation Map of Numerical Variables Correlation Analysis and Add New FeaturesFeature Aggregator)
Logistic regression model is used as a baseline Model, since it's easy to implement yet provides great efficiency. Training a logistic regression model doesn't require high computation power. We also tuned the regularization, tolerance, and C hyper parameters for the Logistic regression model and compared the results with the baseline model. We used 15 fold cross fold validation with hyperparameters to tune the model and apply GridSearchCV function in Sklearn.
Below is the workflow for the model pipeline.
Below is the resulting table for the three baseline that we performed on the given dataset. Please refer section Final results
In the first experiment we had imbalanced data which resulted in high Training accuracy i.e around 91.9% and test accuracy as 91.85%. After generating the confusion matrix, we could see that the reason for high accuracy was due to very low data samples for default class.Accuracy score was not a proper metric to measure the imbalanced data and that led us into balance the data and test again.
Misconception of imbalanced data results led us to resample of minority data, so that we have both the classes evened out for much reliable accuracy. The training accuracy with a balanced data set now came out to 68.45% and test accuracy as 68.68%. The Test AUC slightly came better than the previous model around 73.69%.
Our final baseline is based on the best hyper parameters for the logistic regression and the training accuracy came around 68.4% and the training AUC score as 74.81. We used the final model to run the whole one third of training data and got the training accuracy as 68.58% and train ACU score as 74.3%.
We used the baseline model with best hyper parameters for Kaggle submission as this had better test accuracy and AUC.
In HCDR project, we are using Home Credit’s data to better predict repayment of a loan by a customer with little/no credit history, has a real-world impact. Providing credits to people who are credible and in need helps upscale people livelihood.
In this phase of our project we has exhibited a environment to ingest Home Credit’s data, analyze the dataset features, transform and engineer the best parameters, and test machine learning algorithms. Our workflow will allow us to obtain a tuned algorithm with a statistically significant increase in performance compared to untuned logistic regression.
During phase-1 we tested using tuned logistic regression on balanced dataset. Our results are promising as the tuned algorithm was statistically significant (p_value = 0.003), however we believe that in future phases we will obtain models with better performance by engineering additional features and testing additional tuned algorithms.
In this phase a simple feature extraction and engineering was implemented considering the individual tables and the correlations to the target. The problem encountered apart from the accuracy of the base model include:
Below is the screenshot of our best kaggle submission.
Some of the material in this notebook has been adopted from following
Read the following:
The assumptions that were made in Phase 0 was that the data was balanced and there would be less anamolies. The accuracy of the baseline model used, which was logistic regression, would be sufficient to handle the data loaded.
In the next phase we plan to mitigate these problems by using more aggregated features with extensive EDA performed. The performance can be boosted using GPU's and Pytorch can be explored to leverage higher compute power. We also plan to leverage optimized coding techniques for a better performance. Additionally github can be used for collaboration for better versioning and merging of the notebooks.